Introduction to ML Strategy
Orthogonalization
Orthogonalization is a process of reasonably tuning hyperparameters to achieve a high-level performance when building deep learning systems:
- Fit training set well on cost function ($\approx$ human-level performance)
If not $\rightarrow$ train a bigger network / switch to Adam optimization algorithm - Fit dev set well on cost function
If not $\rightarrow$ use regularization / bigger training set - Fit test set well on cost function
If not $\rightarrow$ get a bigger dev set - Perform well in real world
If not $\rightarrow$ change dev set or cost function
Setting up the Goal
Single Number Evaluation Metric
Having a single real number evaluation metric that tells about the new performance is working better or worse than the last idea can lead to making faster progress.
| Classifier | Precision | Recall | $F_1 \ score$ |
|---|---|---|---|
| A | 95% | 90% | 92.4% |
| B | 98% | 85% | 91.0% |
Take the cat recognition as an example. 95% precision of classifier A means that classifier A has 95% chance to successfully predict it is a cat. Recall is the percentage of images correctly recognized by the classifier as cats over all real cat images.
Rather than using two numbers, precision and recall to pick a classifier, the function called harmonic mean of precision $P$ and recall $R$ is more recommended.
$$F_1 \ score = \frac 2{\frac 1{P} + \frac 1{R}}$$
Having a well-defined dev set used to measure precision and recall, plus a single number evaluation metric can speed up iterating.
Another example: compute the average of performance or loss value
| Algorithm | U.S. | China | India | Other | $Average$ |
|---|---|---|---|---|---|
| A | 3% | 7% | 5% | 9% | 6% |
| B | 5% | 6% | 5% | 10% | 6.5% |
| C | 2% | 3% | 4% | 5% | 3.5% |
| D | 5% | 8% | 7% | 2% | 5.25% |
| E | 4% | 5% | 2% | 4% | 3.75% |
| F | 7% | 11% | 8% | 12% | 9.5% |
Satisficing and Optimizing Metric
Another cat classification example: care about the running time in addition to accuracy
| Classifier | Accuracy | Running time |
|---|---|---|
| A | 90% | 80ms |
| B | 92% | 95ms |
| C | 95% | 1,500ms |
We can combine accuracy and running time into an overall evaluation metric:
$cost = accuracy - 0.5 * running\_ time$
Goal: choose a classifier that maximizes accuracy however subject to the running time $\le 100$ms.
In this case, accuracy is an optimizing metric because we want to maximize accuracy. Running time is a satisfying metric, meaning that it just has to be good enough, it just needs to be less than $100$ms and beyond that we don’t really care.
$N$ metrics:
- $1$ optimizing
- $(N-1)$ satisfying
Train / Dev / Test Distributions
A bad example of setting up dev and test set:
$\begin{aligned}
& \text{Regions:} \cr
& Dev = \begin{cases} US \cr UK \cr Other \ Europe \cr South \ America \end{cases} \quad Test = \begin{cases} India \cr China \cr Other \ Asia \cr Australia \end{cases}
\end{aligned}$
The problem with how we’ve set up the dev and test sets in the example above is that, we might spend months innovating to do well on the dev set only to realize that data from test set might be very different than the data in the dev set, i.e., all the months of work spent optimizing to the dev set does not give a good performance on the test set.
Make the dev and test sets come from the same distribution!
When to Change Dev / Test Sets and Metrics
Comparing to Human-level Performance
Machine Learning Fight Simulator
[pic1]: